TMDB - Movie Database Exploration

Featuring Movie Popularity Vs. Revenue

Introduction

The tmdb-movies dataset is composed of movie statistics. It includes data on movie budgets and revenues. Also, the dataset has information on cast members, directors and technical information such as runtime. My analysis will focus on does popularity means increased revenue. Dependent variables I will use are budget, popularity and revenue. Independent variables are release year, date and runtime. I want to see if longer runtimes mean low popularity and or low revenue.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

%matplotlib inline

Data Wrangling

In [2]:
df = pd.read_csv('DataSets/tmdb-movies.csv')
df.head(3)
Out[2]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08

3 rows × 21 columns

I run df.head to view the dataset. I see some columns that may not be used but I do not feel they need to be dropped. For example, homepage and tagline may not be useful in my investigation. But they will not interfer with my analysis. I prefer to keep these columns because questions come up during exploration and I may possibly need them.

In [3]:
# How many rows & columns
df.shape
Out[3]:
(10866, 21)
In [4]:
# Explote data types to insure categories are correct types like strings are objects and numbers are ints or floarts
df.dtypes
Out[4]:
id                        int64
imdb_id                  object
popularity              float64
budget                    int64
revenue                   int64
original_title           object
cast                     object
homepage                 object
director                 object
tagline                  object
keywords                 object
overview                 object
runtime                   int64
genres                   object
production_companies     object
release_date             object
vote_count                int64
vote_average            float64
release_year              int64
budget_adj              float64
revenue_adj             float64
dtype: object
In [5]:
# Explore further and notice the release_date is an object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB
In [6]:
# Convert release_date from object to date
df['release_date'] = pd.to_datetime(df['release_date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   id                    10866 non-null  int64         
 1   imdb_id               10856 non-null  object        
 2   popularity            10866 non-null  float64       
 3   budget                10866 non-null  int64         
 4   revenue               10866 non-null  int64         
 5   original_title        10866 non-null  object        
 6   cast                  10790 non-null  object        
 7   homepage              2936 non-null   object        
 8   director              10822 non-null  object        
 9   tagline               8042 non-null   object        
 10  keywords              9373 non-null   object        
 11  overview              10862 non-null  object        
 12  runtime               10866 non-null  int64         
 13  genres                10843 non-null  object        
 14  production_companies  9836 non-null   object        
 15  release_date          10866 non-null  datetime64[ns]
 16  vote_count            10866 non-null  int64         
 17  vote_average          10866 non-null  float64       
 18  release_year          10866 non-null  int64         
 19  budget_adj            10866 non-null  float64       
 20  revenue_adj           10866 non-null  float64       
dtypes: datetime64[ns](1), float64(4), int64(6), object(10)
memory usage: 1.7+ MB
In [7]:
# Verify change to release_date
df.head(3)
Out[7]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. ... Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 2015-06-09 5562 6.5 2015 1.379999e+08 1.392446e+09
1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. ... An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 2015-05-13 6185 7.1 2015 1.379999e+08 3.481613e+08
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You ... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 2015-03-18 2480 6.3 2015 1.012000e+08 2.716190e+08

3 rows × 21 columns

In [8]:
# check unique data in each column, look for missing data
df.nunique()
Out[8]:
id                      10865
imdb_id                 10855
popularity              10814
budget                    557
revenue                  4702
original_title          10571
cast                    10719
homepage                 2896
director                 5067
tagline                  7997
keywords                 8804
overview                10847
runtime                   247
genres                   2039
production_companies     7445
release_date             5909
vote_count               1289
vote_average               72
release_year               56
budget_adj               2614
revenue_adj              4840
dtype: int64
In [9]:
# Here we check for null data, which there is null data
# Identifies null values by row
null_data = df[df.isnull().any(axis=1)]
null_data
Out[9]:
id imdb_id popularity budget revenue original_title cast homepage director tagline ... overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
18 150689 tt1661199 5.556818 95000000 542351353 Cinderella Lily James|Cate Blanchett|Richard Madden|Helen... NaN Kenneth Branagh Midnight is just the beginning. ... When her father unexpectedly passes away, youn... 112 Romance|Fantasy|Family|Drama Walt Disney Pictures|Genre Films|Beagle Pug Fi... 2015-03-12 1495 6.8 2015 8.739996e+07 4.989630e+08
21 307081 tt1798684 5.337064 30000000 91709827 Southpaw Jake Gyllenhaal|Rachel McAdams|Forest Whitaker... NaN Antoine Fuqua Believe in Hope. ... Billy "The Great" Hope, the reigning junior mi... 123 Action|Drama Escape Artists|Riche-Ludwig Productions 2015-06-15 1386 7.3 2015 2.759999e+07 8.437300e+07
26 214756 tt2637276 4.564549 68000000 215863606 Ted 2 Mark Wahlberg|Seth MacFarlane|Amanda Seyfried|... NaN Seth MacFarlane Ted is Coming, Again. ... Newlywed couple Ted and Tami-Lynn want to have... 115 Comedy Universal Pictures|Media Rights Capital|Fuzzy ... 2015-06-25 1666 6.3 2015 6.255997e+07 1.985944e+08
32 254470 tt2848292 3.877764 29000000 287506194 Pitch Perfect 2 Anna Kendrick|Rebel Wilson|Hailee Steinfeld|Br... NaN Elizabeth Banks We're back pitches ... The Bellas are back, and they are better than ... 115 Comedy|Music Universal Pictures|Gold Circle Films|Brownston... 2015-05-07 1264 6.8 2015 2.667999e+07 2.645056e+08
33 296098 tt3682448 3.648210 40000000 162610473 Bridge of Spies Tom Hanks|Mark Rylance|Amy Ryan|Alan Alda|Seba... NaN Steven Spielberg In the shadow of war, one man showed the world... ... During the Cold War, the Soviet Union captures... 141 Thriller|Drama DreamWorks SKG|Amblin Entertainment|Studio Bab... 2015-10-15 1638 7.1 2015 3.679998e+07 1.496016e+08
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10861 21 tt0060371 0.080598 0 0 The Endless Summer Michael Hynson|Robert August|Lord 'Tally Ho' B... NaN Bruce Brown NaN ... The Endless Summer, by Bruce Brown, is one of ... 95 Documentary Bruce Brown Films 2066-06-15 11 7.4 1966 0.000000e+00 0.000000e+00
10862 20379 tt0060472 0.065543 0 0 Grand Prix James Garner|Eva Marie Saint|Yves Montand|Tosh... NaN John Frankenheimer Cinerama sweeps YOU into a drama of speed and ... ... Grand Prix driver Pete Aron is fired by his te... 176 Action|Adventure|Drama Cherokee Productions|Joel Productions|Douglas ... 2066-12-21 20 5.7 1966 0.000000e+00 0.000000e+00
10863 39768 tt0060161 0.065141 0 0 Beregis Avtomobilya Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z... NaN Eldar Ryazanov NaN ... An insurance agent who moonlights as a carthie... 94 Mystery|Comedy Mosfilm 2066-01-01 11 6.5 1966 0.000000e+00 0.000000e+00
10864 21449 tt0061177 0.064317 0 0 What's Up, Tiger Lily? Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... NaN Woody Allen WOODY ALLEN STRIKES BACK! ... In comic Woody Allen's film debut, he took the... 80 Action|Comedy Benedict Pictures Corp. 2066-11-02 22 5.4 1966 0.000000e+00 0.000000e+00
10865 22293 tt0060666 0.035919 19000 0 Manos: The Hands of Fate Harold P. Warren|Tom Neyman|John Reynolds|Dian... NaN Harold P. Warren It's Shocking! It's Beyond Your Imagination! ... A family gets lost on the road and stumbles up... 74 Horror Norm-Iris 2066-11-15 15 1.5 1966 1.276423e+05 0.000000e+00

8874 rows × 21 columns

In [10]:
# identify null values by columns
# Here the missing data is not consequencial becuase they are objects
# homepage for example may not have existed or is taken down, it is not relevant to the questions we need to answer
# therefore I have determined there is no need to fill in null data in this dataset
null_columns = df.columns[df.isnull().any()]
df[null_columns].isnull().sum()
Out[10]:
imdb_id                   10
cast                      76
homepage                7930
director                  44
tagline                 2824
keywords                1493
overview                   4
genres                    23
production_companies    1030
dtype: int64

Exploratory Data Analysis

In [11]:
df.describe()
Out[11]:
id popularity budget revenue runtime vote_count vote_average release_year budget_adj revenue_adj
count 10866.000000 10866.000000 1.086600e+04 1.086600e+04 10866.000000 10866.000000 10866.000000 10866.000000 1.086600e+04 1.086600e+04
mean 66064.177434 0.646441 1.462570e+07 3.982332e+07 102.070863 217.389748 5.974922 2001.322658 1.755104e+07 5.136436e+07
std 92130.136561 1.000185 3.091321e+07 1.170035e+08 31.381405 575.619058 0.935142 12.812941 3.430616e+07 1.446325e+08
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 0.000000e+00 0.000000e+00
25% 10596.250000 0.207583 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 0.000000e+00 0.000000e+00
50% 20669.000000 0.383856 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 0.000000e+00 0.000000e+00
75% 75610.000000 0.713817 1.500000e+07 2.400000e+07 111.000000 145.750000 6.600000 2011.000000 2.085325e+07 3.369710e+07
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 4.250000e+08 2.827124e+09
In [12]:
df.hist(figsize=(8, 8));

Research Question 1 - What was the most popular movie of 2015 and what revenue amount did it generate?

In [13]:
# For 2015 most popular movie and what revenue did it generate?
ryear2015 = df.query('release_year == "2015"')
df2015 = ryear2015[ryear2015['popularity'] == ryear2015['popularity'].max()]
print(df2015.loc[:,['original_title', 'popularity', 'revenue']])
   original_title  popularity     revenue
0  Jurassic World   32.985763  1513528810

“Jurassic World” was the most popular movie of 2015, but according to the query below it did not make the most revenue.

In [14]:
ryear2015top5 = ryear2015.nlargest(5, ['popularity'])
ryear2015top5.loc[:,['original_title', 'popularity', 'revenue']]
Out[14]:
original_title popularity revenue
0 Jurassic World 32.985763 1513528810
1 Mad Max: Fury Road 28.419936 378436354
2 Insurgent 13.112507 295238201
3 Star Wars: The Force Awakens 11.173104 2068178225
4 Furious 7 9.335014 1506249360
In [15]:
fig = px.bar(ryear2015top5, x='original_title', y='revenue', color='popularity', title='Top 5 Popular Movies of 2015', 
             hover_name="original_title")
fig.show()

Of the top 5 in popularity for 2015, “Star Wars: The Force Awakens” had the highest revenue.

Also, note the bar chart is arranged by popularity and we can quickly see the most popular had the lowest revenues.

Research Question 2 - Does runtime of a movie affects the popularity?

In [16]:
# Does runtime effect popularity?
# I also included the release_year as color, which is interesting 

fig = px.scatter(df, x='popularity', y='runtime', color='release_year', title='Runtime Vs. Popularity', hover_name="original_title")
fig.show()

The Scatter Plot above shows the correlation of runtime and popularity. Here we see the longer the runtime the least the popularity. Likewise, the more popular movies have a lower runtime.

In [26]:
#df.plot(x='budget', y='revenue', kind='scatter');
# Let's see runtime plotted with revenue
df.plot(kind='scatter', x='popularity', y='revenue', color='purple', title='Popularity Vs. Revenue', figsize=(14, 8));

From the Scatter Plot above we see that the highest revenue is not the most popular. There are examples of very popular movies that have generated large revenues but the majority of movies seem to cluster around a low popularity rating.

Research Question 3 - How Does the Most Popular Movies Compare to the Least Popular Movies?

In [137]:
# I first sought to find the mean of popularity
df['popularity'].mean()
Out[137]:
0.6464409519602426
In [138]:
# Second, I seperated popularity into two categories: least popular and most popular, using the mean as the central point.
leastpopular = df['popularity']  <= .64644095
mostpopular = df['popularity'] > .64644095
In [139]:
df.revenue[mostpopular].hist(alpha=0.5, bins=2, label='Most Popular')
df.revenue[leastpopular].hist(alpha=0.5, bins=2, label='Least Popular')
plt.xlabel('Popularity')
plt.ylabel('Counts')
plt.title('Most Popular Movies compared to Least Popular Movies')
plt.legend();

The above histogram is composed of 2 bins. On the y axis we have the sample counts and the x axis shows the popularity. By finding the mean I split the samples. The least popular had more samples but collectively generated less revenue. The most popular movies generated more revenue but had fewer number of samples.

In [140]:
df.popularity[mostpopular]
Out[140]:
0        32.985763
1        28.419936
2        13.112507
3        11.173104
4         9.335014
           ...    
10763     0.745447
10820     1.227582
10821     0.929393
10822     0.670274
10833     0.737730
Name: popularity, Length: 3061, dtype: float64
In [141]:
df.popularity[leastpopular]
Out[141]:
215      0.644786
216      0.640151
217      0.638557
218      0.633608
219      0.629561
           ...   
10861    0.080598
10862    0.065543
10863    0.065141
10864    0.064317
10865    0.035919
Name: popularity, Length: 7805, dtype: float64
In [142]:
df.revenue[mostpopular].sum()
Out[142]:
372935640778
In [143]:
df.revenue[leastpopular].sum()
Out[143]:
59784552097

The four cells above show further queries. The most popular printed out; note the length which is displayed in the Histogram as the sample count. I also wanted to see the revenue amount. Most popular is 372,935,640,778 billion dollars compared to least popular at 59,784,552,097 billion.

Conclusion

First of all, I investigated different groups to see if popular movies generated larger revenues compared to movies that are low in popularity. We started with an isolated case from 2015, pulling out the data on the 5 most popular movies that year. Clearly, we saw from the chart that most popular movies had lower revenues compared to some titles that were not as popular.

Next, I constructed some scatter plots to compare variables like runtime to popularity to see if that made a difference. Also, we looked at Runtime and revenue, then revenue compared with popularity.

Finally, I figured the mean of popularity and split the data into two samples: least popular and most popular. We see that the least popular movies had more samples but the most popular collectively made greater revenue.

To conclude, you can find instances where a highly rated popular movie will gross lower revenue compared to less popular movies. However, larger sample groups show the most popular movies will generate large revenues.

In [ ]: